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Abstract — This paper presents a method for reduction of 
Marltov models with large state spaces based on information- 
theoretic criteria. As a cost function we define the KuUback- 
Leibler divergence rate between the process obtained by simply 
partitioning the alphabet of the original chain (which in general 
is not Markov) and its best Markov approximation. We further 
show that the KuUback-Leibler divergence rate between the orig- 
inal chain and the lifting of the optimal Markov approximation 
yields an easy-to-compute upper bound on the cost function. 

By properly defining the lifting, the present work obtains a 
reduction which is closely related to the notion of lumpability. 
It is further shown that the cost function can be minimized by 
employing the information bottleneck method, thus building a 
bridge between Markov theory, control systems, and machine 
learning. 

Index Terms — Markov chain, lumpability, model reduction, 
. information bottleneck method 
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I. Introduction 

Markov models are ubiquitous in many scientific disci- 
plines, from thermodynamics over chemical reaction networks 
to information sciences, where Markov models are used in 
speech recognition, learning theory, and as models for data 
sources. The popularity of these models is due to the fact that 
the Markov property allows both simplified analytic treatment 
and efficient simulations of real-world phenomena, which 
' would otherwise be impossible. 

In some disciplines (e.g., in computational biology lUl and 
for n-gram word models in speech recognition fl]), however, 
the number of states the Markov model can assume (i.e., its 
alphabet size) is too large to permit simulation even harnessing 
', today's computing power It is therefore desirable to obtain 
■models with less states from the original model. One way 
in this direction is to partition the alphabet of the original 
model, where each element in the partition is treated as a state 
of the reduced model. Partitioning the alphabet is equivalent 
to projecting the Markov process through a non-injective 
function. 

A function of a Markov process need not possess the 
Markov property, cf. ||3]; the reduction of the alphabet size 
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comes at a cost prohibiting efficient simulation again. What 
one desires to have is a Markov model with a smaller alphabet 
which is "close" (in whatever sense) to the model obtained by 
just partitioning the alphabet. 

A. Contributions and Related Work 

In this work we attack this problem by introducing a single 
information-theoretic cost function, motivated by a recent 
work of Deng, Mehta, and Meyn IH. As in their work, we em- 
ploy the KuUback-Leibler divergence rate (KLDR) to measure 
the "closeness" between stochastic processes; in particular, 
between the projected process obtained by partitioning the 
alphabet and the Markov chain over this alphabet of smaller 
size. Minimizing the KLDR for a given partition yields the 
best Markov approximation of the projected process. Finding 
the optimal Markov model thus reduces to a search over all 
possible partitions. 

Since computing the KLDR between the projected pro- 
cesses and its Markov approximation (the aggregation) can 
be cumbersome, we lift the Markov approximation to the 
original alphabet; the KLDR can then be evaluated between 
two Markov chains, for which simple analytical expressions 
exist H, Q. Moreover, as we show in this work, the KLDR 
between the Markov chains on the original alphabet presents 
an upper bound on the KLDR between the projected process 
and its aggregation. 

The essential difference between [4\ and the present 
manuscript is the definition of the lifting and its consequences: 
While Deng et al. maximize the redundancy and predictability 
of the aggregation, the lifting proposed here minimizes infor- 
mation loss in some well-defined sense. 

For Markov chains with strongly interacting groups of states 
(which are sometimes called nearly completely decomposable) 
the optimal partition of the alphabet is known to be determined 
by the sign structure of the Fiedler vector, the second largest 
eigenvector Despite spectral graph theory being employed 
for model reduction and Markov chain aggregation for some 
time (e.g., ||6|, Q), ID first showed a connection between 
this eigenvalue-based aggregation method and an information- 
theoretic cost function. 

By introducing a different lifting, we lose this connection to 
eigenvalue-based aggregation; instead we gain the following: 

1) The lifting proposed in this work minimizes an upper 
bound on the KLDR between the projection and its best 
Markov approximation, subject to the requirement that 
the lifted chain is lumpable (see below for a definition 
of lumpability). 



2) The upper bound we obtain is tight in the special case 
where the original chain is strongly lumpable. 

3) Minimizing the upper bound proposed by our lifting 
minimizes information loss in some well-defined sense. 

4) Moreover, this minimization, loosely speaking, yields 
the partition w.rt. which the original chain is "most 
lumpable". 

5) A slight relaxation of the cost function allows the 
application of a clustering method established in the 
machine learning community, the information bottleneck 
method H. 

We furthermore provide new insight in some aspects of the 
lifting proposed in [,4|: 

1) Minimizing the KLDR between the original and the 
reduced-order model as in |4 1 and employing lifting to 
make this quantity computable is equivalent to minimiz- 
ing an upper bound on the KLDR between the projection 
and its best Markov approximation. 

2) As opposed to our lifting, which minimizes information 
loss, the lifting there is shown to maximize the latter by 
maximizing predictability. 

3) Following the notation in |3|, we introduce a compact 
matrix notation for the lifting introduced in [41, allowing 
us to provide new proofs for some of the results shown 
there. 

We do not cite other works concerning different heuristics 
for Markov chain aggregation; however, we strive to men- 
tion f9l, which applied the information bottleneck method to 
partition a graph via assuming continuous-time graph diffu- 
sion. 

B. Outline of the Paper 

We start by introducing notation and information-theoretic 
quantities (Sections III-AI and III-BI) and their application to 
Markov chains (Section FlI-Q and functions of Markov chains 
(Section [II-Dt introducing also the notion of lumpability). We 
then turn to the problem of Markov model order reduction 
in Section [III] There, we both restate results linked to the 
lifting method proposed by |4| (Section HVll and present an 
alternative lifting method together with its elementary prop- 
erties (Section |V]l. Section [Vl] is then devoted to an analysis 
of the information loss induced by the two competing cost 
functions; an illustration of how the information bottleneck 
method can be employed for model order reduction is given 
in Section IVIII A few smaller examples are contained in 
Section IVIIII comparing the two lifting methods in various 
scenarios. 

II. Notation, Preliminaries, and Setup 

A. Random Variables and Stochastic Processes 

Throughout this work we use the following notation: Let 
(il, *B, Pr) denote the probability space on which all ran- 
dom variables (RVs) and stochastic processes are defined. A 
random variable Z with finite alphabet Z assumes values z 
and induces a probability distribution Pz- Let pz denote the 
probability mass function (PMF) of Z, i.e.. 



A discrete-time, one-sided random process Z is a sequence 
of RVs {Zi, Z2, ■ ■ ■} defined on a common probability space. 
We assume each RV Zi takes values from the same, finite, 
alphabet Z. 

For a finite index set I C N, the joint PMF of Zi = {Zjigi 
is 

Vzi 6 Z^: pzAzi) := Pr(X, = x,,i e I). (2) 

In particular, for I = {m, m + 1, . . . ,n} we abbreviate Z^ = 
{Zm, Zm+i, ■ ■ ■ , Zn} and 

PZ^ (Zm) := Pl'(^m = Z„r, . . . , Z„ = Z„). (3) 

Along the same lines one obtains the marginal PMFs pz^ 
of Zn and the conditional PMFs p^ \z"^^ °f ^^ given its 
immediate past, Z"~'^. 

For two probability distributions on the same measurable 
space (Z, ^z), we say Pz' is dominated by Pz" (or Pz" is 
absolutely continuous w.rt. Pz'), in short, Pz' ^ Pz", iff 

yP E ^z: Pz"{B) = ^ Pz'{B) ^ 0. (4) 

This naturally carries over to process distributions and, with 
some abuse of notation, to PMFs. 

The random processes considered in this work are station- 
ary, i.e., for an arbitrary finite I the corresponding joint PMFs 
are shift-invariant. 



Vfc: pz^ = pz,_ 



(5) 



In particular, stationarity implies that the marginal distribution 
of Zk is equal for all k and shall be denoted as pz- 

We believe that under mild conditions all our results can 
be generalized to the larger class of asymptotically mean 
stationary (AMS) processes. A more detailed discussion of 
AMS can be found in 1.1 Oil . 

B. Information-Theoretic Quantities 

Using above notation, we next introduce a few definitions 
from information theory: 

Definition 1 (Information-Theoretic Quantities). The (joint) 
entropy of a collection of RVs Zi is given by 



H{Zi) 



zi<^Z 



pz^(zi)\ogpzJ^zi). 



(6) 



The conditional entropy of RVs Z\ given the RVs Zj is the 
difference between the joint entropies of Ziujj and Zj, 



H{Zi\Z3):=H{Z, 



lujj 



H{Zi)^H{Zi,Z^)-H{Z^) (7) 



and the mutual information between the RVs Zj and the RVs 
Zj is defined as 



nZr, Zj) := H{Zi) + H{Zi) - H{Zi, Zj). 



(8) 



The entropy rate of a stationary stochastic process Z is 
defined as pTTl, Thm. 4.2.1] 



iJ(Z) := lim -H{Z1) ^ lim if(Z„|Zi" 



(9) 



The redundancy rate of a stationary stochastic process Z is 
defined as 



V^eZ: pz{z):=Pz{{z})^¥y{Z^z). 



(1) 



(a) 
i?(Z) :^H{Z)-H{Z) > 



(10) 



where H{Z) is the entropy of the marginal distribution of Z 
and where (a) is due to the fact that conditioning reduces 
entropy lE] Thm. 2.6.5]. 

We note in passing that the entropy rate of an AMS process 
is defined similarily (cf. lfT2l Thm. 4.1]). 

The redundancy rate is a measure of statistical dependence 
between the current sample and its past: For an iid process 
HiZ) = H{Z) and i?(Z) = 0. Conversely, for a completely 
predictable process H{7i) = and R{Z) = H{Z). In other 
words, the higher the redundancy rate, the lower the entropy 
rate and, thus, the less information is conveyed by the process 
in each time step. 

We need another definition for the development of our 
results: 

Definition 2 (Kullback-Leibler Divergence Rate). The 
Kullback-Leibler divergence rate (KLDR) between two sta- 
tionary stochastic processes Z and Z' on the same finite 
alphabet Z is L12. Ch. 10] 



D(Z||Z')= lim -Z3(Zi"||Z'") 



lim - V p2"(z")log 



z'^eZ'^ 



pz'^M) 



(11) 



whenever the Umit exists and if pz'^ ^ Pz'" for all 



The limit exists, e.g., between a stationary stochastic pro- 
cess and a time-homogeneous Markov chain fT2l as well as 
between Markov chains (not necessarily stationary or irre- 
ducible) 0. 

C. Markov Chains 

Let X be a regular, i.e., irreducible and aperiodic, time- 
homogeneous Markov chain on the finite alphabet X ~ 
{!,..., N} (see ID for terminology). Its behavior is uniquely 
determined by its transition matrix P — {Pij}, where Pij :— 
Pr(X„ — i\Xn-i = i)- The unique invariant distribution 
vector /I with its i-th component given by 



Mi 



-pxii) = VY{Xk = i)>Q 



(12) 



satisfies jx^ = f/^P fSl Thm. 4.1.6]. We assume that X 
is stationary, i.e., its initial distribution coincides with the 
invariant distribution: 



Vi e X: Pr(Xi = i) = n. 



(13) 



We therefore use the shorthand notation X ^ Mar(A',P,/i,). 

We note in passing that regular, time-homogeneous Markov 
chains are AMS, even if their starting distribution is different 
from ^ and they are, thus, non- stationary ||T31 . Under mild 
conditions, the results presented below can be assumed to carry 
over to this more general case. 

With Definition [T] the entropy and the entropy rate of X, 
as well as the KLDR between two Markov chains X and X' 

'To be precise, Z and Z' are random variables on the same measurable 
space {Z, 'P{Z)), but with different process probability measures. 



on the same alphabet X with transition matrices P and P' 
are M PP- 77], Q 



H{X) = -^fi,\ogHi 



(14) 



iex 



iJ(X) = HiXi\Xo) - - 51 /^«-Py log^'J- (15) 

i,jGX 



^(X||X')= J2 ^^^P^,^Og^. 



(16) 



respecively, provided that _P/ = implies Pij = (P -C P')- 

D. Functions of Markov Chains 

We partition the alphabet X of the Markov chain X by 
a surjective function g: X ^ y, where y — {!,..., M}. 
Projecting X through the function, i.e., y„ = g{Xn), defines 
another stochastic process Y; in what follows we call this 
process the projected process, or simply the projection of X 
(see Fig. [T]i. 

Clearly, if X is stationary, then so is Y. The following 
inequalities 



H{Y) < H{X) 
HiY) < H(K) 
R{Y) < R(K) 



(17) 
(18) 
(19) 



can be shown to hold by the data processing inequality ifTTI . 
114| and by fBl. 

It is well known that Y is not necessarily Markov anymore. 
The case where Y is a regular, time-homogeneous Markov 
chain, gave rise to the notion of lumpability: 

Definition 3 (Lumpability (3]). A stationary Markov chain 
X ^ Ma.i-{X,P, fi) is lumpable w.r.t. a function g, iff the 
process Y is a regular, time-homogeneous Markov chain with 
alphabet y, transition matrix Q, and invariant distribution u, 
i.e., iff Y - Mar(J^, Q, u). 

Let V be an A^ X Af matrix with Vij = 1 if i G g^^{j) 
and zero otherwise (thus, every row contains exactly one 1). 
Furthermore, U'^ is an Af x A^ matrix with zeros in the same 
positions as V^, but with non-negative row entries which sum 
up to one. In other words, with tt being a probability vector. 



UJ'. 



(20) 



if j G 9~^{'i) and zero otherwise. If the superscript is omitted, 
we let TT be arbitrary. With these definitions in mind we present 

Lemma 1 (Conditions for Lumpability). A stationary Markov 
chain X ^ M'ai-{X, P, fi) is lumpable w.rt. g if either 



VUPV = PV 



(21) 



U^PVU'^ = U'^P (22) 

In both cases, Y ^ Mar(3^, Q, u) with i/^ = jjb^Y and 

Q = U'^PV. (23) 



Remark 1. Since this paper focuses on X being stationary 
(i.e., its initial distribution equals the invariant distribution) 
we also defined lumpability accordingly. In the literature, two 
different notions of lumpability are defined: strong lumpability, 
where Y ^ Mar(3^, Q, i/) for any initial distribution of X 
and where Q is independent of this initial distribution (cf. f3] 
Def. 6.3.1]), and weak lumpability, where at least one initial 
distribution of X leads to Y '^ Mar(3^, Q, u). A chain which 
is weakly lumpable is also lumpable if its initial distribution 
is the invariant distribution (cf. |3, Thm. 6.4.3]). 



Proof: For the first condition 
lumpability in the sense of Q 
Example 6.3.3]. 

For the second condition we assuming that 



the condition for strong 
see 13 Thm. 6.3.5 & 



U'^PVU'' = V^P. 



(24) 



is fulfilled. By E Thm. 6.4.4] it follows that the Markov chain 
X with transition matrix P is weakly lumpable w.r.t. g and 
that thus, if the initial distribution among the states equals the 
invariant distribution, the projected process Y is a Markov 
chain with transition matrix [T, Thm. 6.4.3] 



Q = U'^PV. 



(25) 



Doing a little algebra one can easily show that 

ly^ = u^Q = u^V^PY = /x^PV = n^Y (26) 

establishing the result about the invariant distribution. ■ 

The corresponding result for continuous-time Markov 
chains on a countable alphabet has been proven in lfT6l 
Thm. 2]. In what follows, we will utilize 

Lemma 2. Let X and X' be stationary, time-homogeneous, 
regular Markov chains with transistion matrices P and P' on 
the same alphabet X. Let P' ^ P. We define two processes 
Y and Y' by F„ = g{Xn) and F,^ = g{X^J, g: X ^ y. Let 
additionally X' be lumpable w.r.t. g. We have 



i:>(X||X') >D(Y||Y'). 



(27) 



Proof: Clearly, all involved processes are stationary. Since 
P' > P, Z)(X||X') exists and equals ^ (5). Since X' is 
lumpable, Y' is a regular, time-homogeneous Markov chain. 
Moreover, from P' ^ P and, thus, Px' ^ ^x, it follows 
that Py' > Py- This ensures the existence of D{Y\\Y') IfTH 
Lem. 10.3]. 

Finally, since the KuUback-Leibler divergence reduces un- 
der measurements (e.g., Ill2l Cor 3.3] or [14, Ch. 2.4]), i.e., 
for all n. 



D{X^\\X"l) >D{Y{'\\Y"l] 



we obtain 



lim -D{X^\\X"l)> lim -DiY^'WY'l) 

n— >^oo Ti n— >^oo Ji 

This completes the proof. 



(28) 



(29) 



L'(X||X') 
X^(P,/x) ^ ^ X'^(P',/x') 



o 



\ 3. 



CD 

Oh 



L»(Y||Y') ^ 

Y ^ ^ Y'^(Q,iy) 



Fig. 1. Illustration of the problem: Assume a Markov chain X is given. 
We are interested in finding an aggregation of X, i.e., a Markov chain Y' 
on a partition of the alphabet of X. This partition defines a function g (and 
vice-versa), which allows us to define a process Y (via Y„ = g{Xn)), the 
projection of X. Note that Y need not be Markov. Lifting Y' yields a Markov 
chain on the original alphabet, which can be projected to Y' using the function 
9- 



III. Problem Statement 

Throughout the remainder of this work we will stick to the 
following 

Assumption 1. Let X be a stationary, discrete-time, irre- 
ducible and aperiodic, time-homogeneous Markov chain over 
the alphabet X = {l,...,iV}, with transition matrix P and 
an initial distribution equal to its invariant distribution ^ (i.e., 
X ~ Mar(A', P, /x)). 

Let further g: X ^ y ~ {!,..., Af} be a surjective 
function onto a set of cardinality card(3^) — M strictly smaller 
than the cardinality card(A') = A^ of its domain X. In other 
words, g induces a partition of X by the preimage^ of the 
elements of 3^. 

In addition, for a given partition function g: X ^ y, we 
adopt the notation implicit in Lemma |2] 

> Y, for stationary processes over the alphabet 3^, 

> Y', for Markov processes over the alphabet 3^, and 

> X', for Markov processes over the alphabet X which is 
lumpable w.rt. g. 

We need to distinguish a specific stationary process over the 
alphabet y, namely, the one generated by passing X through 
the partition function g: 

Definition 4 (g-projection). Given X and g as in Assump- 
tion [T] the g-projection of X is a process Yg over the alphabet 
y, whose samples are defined by 



Yg,n — 9{Xn)- 



(30) 



We are interested in performing model reduction by em- 
ploying information-theoretic cost functions. In particular, we 
specify the M-partition problem, equivalently defined in ||4]: 

Definition 5 (M-partition problem). Given X and g as in As- 
sumption[Tl the M -partition problem searches for the partition 

-Given a state j G y, with slight abuse of notation we write g~^{j) for the 
preimage of j under g, that is, s"^(j) ;= g~^{{j}) = {i £ X \ g{i) = j}. 



function g such that the KLDR between the g-projection of X 
and its best Markov approximation is minimal, i.e., it solves 

arg min min{5(Yg||Y') |Y' is Markov}. i^il) 

For a fixed partition function g, this optimal Y', i.e., the 
"best" Markov approximation (in the sense of the KLDR) of 
the (/-projection Yg can be found analytically. With the matrix 
notation introduced in Section III-DI we present 

Lemma 3. Given X and g as in Assumption [T] denote by 
Y' the best Markov approximation of the g-projection in the 
sense of the KLDR, i.e.. 



Y^ = argminD(Yg||Y'). 

Then, Y^ - Mar(3^, Q, v) with v^ = /x^V and 
Q = U'^PV, 



which is a matrix notation for 



Qkl — 



E 



iSg-Mfc) ^^ 



k,l€y. 



(32) 



(33) 



(34) 



From now on, we keep the notation Y' for the optimal 
aggregation (Fig. [U of X, given a partition function g. 

Proof: See [12, Cor 7.4.2] or |4, Thm. 1] and the 
references therein. ■ 

Remark 2. The same aggregation was declared being optimal 
in ifTTl . although by using a different cost function. 

One thus obtains the desired the transition matrix Q of the 
optimal Markov model Y' from the joint distribution of two 
consecutive samples of Yg. If this joint distribution completely 
specifies the process Yg, then Yg — Y' i.e., X is lumpable 
(cf. Lemma [1]). Note further that since P is the transition 
matrix of a regular Markov chain, so is Q fs! p. 140]. 

We can now define the aggregation error of X w.rt. g: 

Definition 6 (Aggregation error). Given X and g as in 
Assumption [T] let Y' be the optimal aggregation w.rt g, as 
introduced in Lemma [3] Then, 



^(YgiiY;) 

defines the aggregation error of X w.rt. g. 



(35) 



Clearly, for a lumpable Markov chain X the aggregation 
error will be zero. 

Following [4j, we split the 7\f -partition problem into two 
sub-problems: finding the best Markov approximation of a 
projected process Yg (to which Lemma [3] provides the solu- 
tion), and minimizing the aggregation error over all partition 
functions g with a range of cardinality M. Thus, the optimiza- 
tion problem stated in (ISTT i translates to finding the minimizer 
among all possible partitions: 



IV. tt-lifting: Bounding the Aggregation Error 

Often, a direct evaluation of the aggregation error in Def- 
inition |6] is mathematically cumbersome, since Yg is not 
necessarily Markov. The authors of |4| therefore suggested to 
lift the aggregation Y' to a Markov chain X' over the alphabet 
X, which subsequently allows a computation of the KLDR. 

The questions is now, whether there is a relation between the 
KLDR between X and the lifted chain X', and the aggregation 
error, 5(Yg||Y' ). Relying on Lemma |2] we will answer this 
question affirmatively, essentially bounding the aggregation 
error from above. 



Definition 7 (7r-lifting). Given X and g as in Assumption [T] 
Y' ~ Mai{y,Q,v) as in Lemma |3] and tt be a probability 
distribution over the alphabet X. The -K-lifting of Y' w.rt. g, 
denoted by X''^, is a Markov chain over the alphabet X with 
transition matrix 



P' = VQU" 



which is a matrix notation for 

TT,- 



P' 



U 



T.keg-HgU))'^^ 



'g{i)a{3)^ 



i,j ex. 



(37) 



(38) 



Remark 3. An equivalent lifting method is suggested in ifTSll . 
im and JlT]. 

We conclude this section by presenting the elementary 
properties of 7r-lifting. Properties 1), 2), and 3) appear also 
in |4|; the proofs can be found in Appendix lAl 

Proposition 1 (Properties of 7r-lifting). Given X and g as 
in Assumption [7] Y' the optimal aggregation as defined in 
Lemma \3\ and tt some distribution over X. Then, the Tv-lifting 
X'J^ satisfies 

1) X'J^ is lumpable w.rt. g (and Y' is the resulting g- 
projection); 

2) The invariant distribution of X''^ is (i; 

3) Ai = argmin^5(X||X;-);' 

4) For positive tt, P' ^ P; 

5) ^(Yg||Y;)<^(X||X;^). 

V. A Better Bound via P-lifting 

In Section IIVI we showed that the KLDR between X and 
the /x-lifting^ X'** provides an upper bound on the aggregation 
error for a given partition function g. Unfortunately, the bound 
is loose in the sense that for D(Yg\\Y' ) — 0, we may have 
i)(X||Xg'^) > 0; see also [T9\. One of the reasons for this 
disadvantage of 7r-lifting is that, by construction, the lifted 
process X'J^ does not contain information about the transition 
probabilities between states of X. We therefore propose a 
lifting which takes into account the transition matrix P of 
the original process. 

Definition 8 (P-lifting). Given X and g as in Assumption [T| 
and Yg ~ Mar(3^, Q, i/) the optimal aggregation as defined 
in Lemma[3] The P-lifting of Y' w.r.t. g, denoted by X'?", is 



^'^^.l^^.l^^^^""^- 



(36) 



^i.e., the 7r-lifting obtained by chosing n to be the invariant distribution fi 
of X 



a Markov chain over the alphabet X with a transition matrix 
P given as 



P,. 




gii)9U)^ 



if E 



kes-i 
keSj 



P^k > 



if Tk^s. P^k = 



(39) 



One of the main contributions of this paper is to show that 
the KLDR between X and the P-Ufting X^^ yields a better 
bound than what would be obtained using the 7r-lifting. In 
Appendix IBJ we prove 

Theorem 1 (Properties of P-lifting). Given X and g as in 
Assumption [7] and Y' the optimal aggregation as defined in 
Lemma \3\ Then, the t'-lifting X' satisfies 

1) X'?" is lumpable w.r.t. g (and Y' is the resulting g- 
projection); 

2) P>P; 

3) (minimizer) 



riP 



arg min _ D(X||X) (40) 

X:Y' is g-projection ofX. 



4) (tighter bounds than Ti-Ufting} 

5(Y,||Y;)<5(X||Xf)<5(X||X;'^) 



(41) 



5) (tight bounds) If X is strongly lumpable, i.e., satis- 
fies (I211 l. then 



^(Y,||Y') = 0^5(X||Xf) = 



(42) 



As shown in the proof, the tightness result follows from 
the fact that the P-lifting then yields P = P; as a direct 
consequence, in this case also the invariant distribution of 
P trivially coincides with /x, the invariant distribution of P. 
In general, however, the invariant distribution of P differs 
from /^t, contrasting the corresponding result for the 7r-lifing 
(cf. Proposition [T] property 2). 

Interestingly, the restriction to strongly lumpable chains 
for the tightness result cannot be dropped either: There are 
Markov chains X which are lumpable in the weak sense 
(i.e., not for all initial distributions but, e.g., for the invariant 
distribution) for which consequently the aggregation error 
vanishes, but for which the P-lifting does not yield P = P. 
A simple example for a transition matrix of such a chain is 
given in |3] pp. 139] (cf. Section IVlirDb . 

As this theorem shows, P-lifting yields the best upper bound 
on the aggregation error achievable for Markov chains over 
the alphabet X. This can also be explained intuitively, by 
expanding the KLDR as 



D(X||X-) 



jjeA- 



Pi' 
fiiPij log -^ 



(43) 



i,jGX 



fJ-iPij log 



E 



keSj 



P, 



ik 



(a) 



^ fiiPij log 



^g(i)g(j) 



(44) 



zZk^Si ^'^ E/G5j P> 



kl 



HO'^g,n\yg,n~l) — Pi iXg,n\Xn-l) 



(45) 
(46) 



where (a) is due to Lemma [3] Note that the last line corre- 
sponds to the difference between the upper and lower bounds 
on the entropy rate of a function of a Markov chain ifTTI 
Thm. 4.5.1]; equality of these bounds implies Markovity of 
Yg, i.e., lumpability of X w.rt. g. In other words, minimizing 
this cost function yields the function g for which the projected 
process Yg is "as Markov as possible". 

In contrast, the 7r-lifting tries to maximize the redundancy 
rate of the projected process Yg (see Section|Vl]or [4, Sec. Ill] 
for a detailed analysis). This admits, at least for the bi- 
partition problem {M — 2) and under specific conditions on 
the transition matrix of X, an interpretation in terms of the 
spectral theory of Markov chains. Whether the P-lifting also 
admits such an interpretation is within the scope of future 
work. 

VI. Model Reduction and Information Loss 

We now analyze the how the KLDR between the original 
process and a lifted process (be it either tt- or P-lifting) 
connects with the information loss induced by the projection 
function g. In doing so, one may ask if the model reduction 
problem can be solved using information-theoretic algorithms, 
such as the information bottleneck method. We show in the 
following section that this question can be answered affirma- 
tively. 

Since for the 7r-lifting we will exclusively use the minimizer 
TT — fj, (cf. Proposition [TJ, in this section we will also refer 
to it as /x-lifting. 

A. fi-lifting and Information Loss 

We start with analyzing the /x-lifting suggested by H. 
Recall that the goal is to find a partition of the original 
alphabet X (induced by a functiom g) such that the KLDR 
is minimized: 
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argminL>(X||X''^) = arg max ^^(Y' 



(47) 



where (a) is due to p] Lem. 3]. Thus, a minimization of 
i)(X||X''') can be stated as a maximization of _R(Yg), since 
the redundancy rate of X is independent of g. A direct 
consequence is that the output process Ygo contains as little 
information as possible. 

To this end, note that H{Y^) = H{Yg) since Yg and Y^ 
have the same marginal distribution. However, 



iJ(Y')=ij(rg,l|rg,0)>i?(Yg) 



(48) 



by the fact that conditioning reduces entropy [TT" Thm. 2.6.5]. 
With Definition [T] a maximization of ^(Y') minimizes an 
upper bound on the entropy rate H(Yg) of the projected 
process Yg while simultaneously maximizing its marginal 
entropy. Since the entropy rate of a process is a measure of 
the average amount of information the process conveys in each 
time step, the solution to above minimization problem yields a 
function g° with as little information at its output as possible. 
In other words, one tries to maximize information loss. This is 
in line with the reasoning in f4\, where the optimal solution 
was characterized as the model that is most "predictable". 



Despite this counter-intuitivity of the cost function, the 
model reduction method proposed by H works and, given 
some assumptions on the eigenstructure of the transition 
matrijqj, has a justification in spectral theory. In particular, 
the bi-partition problem is sub-optimally solved employing 
the sign structure of the second eigenvector, the Fiedler vector 
The authors of L4J then solve the model reduction to a alphabet 
of cardinality card (3^) = M by recursively applying the bi- 
partition problem, i.e., by refining the partitions iteratively. 

B. P-lifting and Information Loss 

We now pay the same attention to analyzing the meaning 
of the P-lifting. We need 

Definition 9 (Relevant Information Loss ll20l ). Let X be an 
RV with finite alphabet X, and let Y = g{X). Let S be another 
RV with alphabet S representing relevant information. The 
information loss relevant w.r.t. S is 

Ls{X -^Y)= I{S; X) - I{S- Y) = I{X- S\Y). (49) 

We now make a connection between Definition |9] and the 
KLDR between X and the P-lifted chain X^^. To this end, 
let 

(50) 



g':=argmini?(X||X';') 



Recall from (|46ll 

5(X||Xf ) = H {Yg,n\Yg, n-l) - H{Yg^n\Xn^l) (51) 

which, by adding and subtracting HiYg^n) can be rewritten as 



L>(X||X' 



-'-y^ g^n-i -^n—l) -^\^g.7i-, ^ g,n—l) 



- LVg^A^n- 



Y, 



g-n-^i 



(52) 



where Ly „(X„_i -^ Yg „_i) is the information loss relevant 
w.r.t. Yg^n induced by projecting the previous sample Xn-i 
through the function g. Finding the optimal function g' thus 
amounts to minimizing information loss. 

To the present date, we could not verify if this cost function 
has an interpretation in spectral theory. However, as mentioned 
above, it minimizes the first-order upper and lower bounds on 
the entropy rate of the projected process Yg. Upper and lower 
bounds are equal only if the original process is lumpable w.r.t. 
g 121]. Minimizing this cost function thus amounts to making 
the projected process as Markov as possible. 

VII. The Information Bottleneck Method: A 
Possible Way to Model Reduction 

We now make a connection between the model reduction 
problem and a well-known information-theoretic algorithm: 
the information bottleneck method |8|. Since in ||2p| the 
information bottleneck (IB) method was reformulated in terms 
of relevant information loss, the results of Section IVI-BI will 
be essential for the development of the following paragraphs. 

Let X be a discrete RV corresponding to an observation, a 
signal, or a data set. We are now interested in a compressed 
representation Y of this RV, i.e., in the output of a compression 

''To be precise, on the additive reversibilization of it. 



channel. In rate-distortion theory (e.g., [il2i ) one pursues the 
goal to minimize the mutual information between the signal X 
and its compression Y, I{X; Y), subject to satisfying a certain 
distortion criterion d (e.g., the mean-squared reconstruction 
error). This can be cast as a variational problem 



mmI{X;Y) 

Py\x 



- l3d{X, Y) 



(53) 



where /3 is a Lagrange multiplier, and where stochastic com- 
pressions PY\x{x,y) = Pr(y — y\X — x) are permitted. 

The IB method takes up this approach by replacing the 
distortion measure by the negative mutual information between 
the compressed RV Y and a relevant RV S, representing the 
information one considers as meaningful and one wants to 
preserve. The IB method therefore tries to solve 



min I{X;Y) 

Py\x 



PI{S;Y) 



(54) 



where the minimization runs over all stochastic relationships 
and where /3 trades compression and preservation of informa- 
tion: A large value of /? places emphasis on preservation of 
relevant information, while a small value leads to high com- 
pression. Typical applications of the IB method include word 
and document clustering ll22l . ||231 and speech processing ll24l . 
J25l. 

With /3 -^ oo. 



I{S;Y)^I{S;X)-Ls{X^Y) 



(55) 



and with the restriction to deterministic compressions Py\x 
determined by functions g: X ^ y, one obtains a formulation 
of the IB method which minimizes the relevant information 
loss, i.e., which solves 



min Ls(X^Y). 
geix^y] 



(56) 



For this problem, in [23] an iterative procedure, called 
agglomerative IB was introduced, which successively merges 
two elements of a partition of X until the desired cardinality 
M is reached. Although the two elements being merged are 
chosen such that the reduction of relevant information is 
minimized, the method is only "locally optimal" ||231 and thus 
will induce a larger relevant information loss than a direct 
solution of ( |56] |. 

Comparing ( |56] | with ( f52] i one can see that !"„ takes the 
place of the relevant information S and thus depends on g, 
i.e., on the object to be optimized. Since in such a case the 
IB method is not applicable directly, we apply Il20l Cor 1]: 

Ly^ „(X„_i ^ y3,„_i) < Lx,XXn-l -^ >;,„-l) (57) 

In order to apply the (agglomerative) IB method, the problem 
has to be relaxed: Instead of minimizing l)(X||X'^), we only 
minimize its upper bound given by ( fSTJ i. We thus look for 

g^^ := argminLx„(^n-i -^ ^s,n-i)- (58) 

9 

We note that a connection between Markov chains and 
the IB method was akeady made in [26]. There, the authors 
assumed a data clustering problem where the data is given as 
a matrix of pairwise distances. They then propose to generate 
a Markov transition matrix based on these pairwise distances 



before letting the process relax to a quasi-stable point (in terms 
of the row structure of the powers of the transition matrix). 
Applying the IB method at this point yields a clustering of 
the dataset, despite the absence of a probability distribution 
describing it. The authors of ll26l even note that "two [...] 
points become similar when they reach the other states with 
similar probabilities following partial relaxation" - a condition 
similar to the requirement for lumpability. However, the con- 
nection to lumpability and, thus, to order reduction of Markov 
models has not been made by the authors. 

A. Sub-Optimality of AIB 

By relaxing the optimization problem to (l58T l one looses the 
property that the cost function minimizes the upper bound on 
D(Yg\\Y' ). However, as we show immediately, the obtained 
upper bound is still better than i)(X||X^'^): 

Lx„iXn-l ^ Yg^r.-l) (59) 

= H{X,,\Yg,n-i) - H{X,,\Xn-i) (60) 

= H{Xr,,Yg^r.\Yg,n-l)-H{X) (61) 

= H{Xr,\Yg,,„Yg^n-i) + H{Yg^n\Yg,n^i) ~H{X) (62) 
" V ' 

< if (x„|yg,„) + i?(Y;) - ij(x) (63) 

= HiX)-HiY^)+HiY'g)^H{X) (64) 

= i?(X) - R{Y'g) (65) 

= 5(X||X;'^) (66) 

where the last line is due to ||4] Lem. 3]. 

Aside from AIB being suboptimal compared to the IB 
method, the solution of the relaxed problem might not coincide 
with the solution of the original problem. To be specific: Even 
if a Markov chain X is lumpable, neither the AIB nor the 
IB method implementing the relaxed optimization problem 
necessarily find the optimal M-partition. We will elaborate 
on this topic in one of the examples in Section IVIII-CI 

VIII. Examples 

In this section we illustrate our theoretical results at the 
hand of a few examples. In particular, we show the applica- 
bility of the information bottleneck method for Markov chain 
aggregation in Section IVIII-BI 



TABLE I 

Results of Example 1 



Partition 


KLDR in) 
bit/sample 


KLDR (P) 
bit/sample 


[0.347,0.388,0.265^ 


{{1,2}, {3}} 
{{1,3}, {2}} 
{{1},{2,3}} 


0.823 
0.808 
0.037 


0.185 
0.317 
0.001 


[0.077,0.658,0.265'^ 
[0.065,0.388,0.546^ 
[0.347,0.388,0.265^ 



the invariant distribution of the original chain X. The results 
are shown in Table |T] 

As it can be seen, the partition {{!}, {2, 3}} yields the best 
results in terms of KLDR. Moreover, it can be seen that the 
KLDR using P -lifting is smaller than the KLDR using fi- 
lifting in all three cases, as suggested by Theorem[T] However, 
while the /x-lifting yields the same invariant distribution as the 
original chain has, our method obtains quite different values. 
An exception is the optimal partition, where Yg and Y' are 
very close in terms of the KLDR. 

B. Example 2 

In this example we applied the transition matrix P from ||4] 
Fig. 7] and used the agglomerative information bottleneck 
method f23l to aggregate the chairQ as described in Sec- 
tion [yll As it can be seen in Fig. [3] the partitions of the 
alphabet appear to be reasonable and, for M ~ 5, coincide 
with the solution obtained in f^. In essence, the aggregation 
reduces the alphabet to groups of strongly interacting states. 

An interesting fact can be observed by looking at Fig. |2] 
which compares the KLDR curves for both lifting methods 
(the aggregation was obtained using the agglomerative IB 
method in both cases). While using 7r-lifting the KLDR 
seems to be a function decreasing with increasing Af, the 
same does not hold for the P-lifting: If a certain partition is 
"nearly" lumpable, the KLDR curve exhibits a local minimum. 
Trivially, global minima with value zero are obtained for 
M = 1 and M = N. 

For TT-lifting the number of strongly interacting groups 
was claimed to correspond to a change of slope of the 
KLDR curve |4, Section V.D]. Utilizing the tighter bound by 
employing P-lifting allows to choose model order by detecting 
local minima, which is simpler than by detecting the change 
of a slope. 

,0 



A. Example 1 

We take the matrix given in ||4] Section V.A] 

0.97 0.01 0.02 
0.02 0.48 0.50 
0.01 0.75 0.24 



(67) 



and use three different functions g inducing partitions of the 
alphabet X: {{1, 2}, {3}}, {{1, 3}, {2}}, and {{1}, {2, 3}}. 

For all the resulting aggregations we compute upper bounds 
on the aggregation error using both the /i,-lifting and the P- 
lifting. In addition to that, the invariant distributions of the P- 
lifted Markov chains X'^ are computed and compared to /x. 



C. Example 3 

In this example we show that the relaxed optimization 
problem not necessarily finds the optimal partition. We start 
with a Markov chain X with a state space X with cardinality 
A^ = 3 and look at the bi-partition problem (i.e., M = 2). Let 
the transition matrix be given as 



(68) 



'We used the VLFeat Matlab implementation |27 1 of the agglomerative IB 
method. 
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Fig. 3. An illustration of Example 2: The original transition matrix (a) and the transition matrices obtained by aggregation using the agglomerative information 
bottleneck method. Blocks of the same color indicate that the corresponding states are mapped to the same output. M = 3 (b) and M = 5 (c). 
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Fig. 2. KLDR for the /x- and the P-lifting for different cardinalities M 
of the aggregated chain's alphabet. Both curves were obtained using the 
agglomerative IB method. 



Since this chain is lumpable for a partition {{12}, {3}} 
(induced by the optimal function g°), one obtains 



Ly^a^^iXn-X — ^ i^g°,n-l) — 0. 



(69) 



Computer simulations show, however, that this partition leads 
to a larger value of i?(X„|ygo „_i) than the other two options 
(namely, 1.19 bit compared to 0.55 and 0.69 bit, respectively). 
Since here the AIB and IB methods coincidqj, this example 
shows that the relaxation of the optimization problem not 
necessarily leads to the optimal partition. 



D. Example 4 

We finally take an example from 13] pp. 139], which shows 
that our upper bound on the aggregation error is not tight in 
general, but only for strongly lumpable X. To this end, let 



(70) 



As it can be verified easily, this chain is lumpable w.rt. the 
partition {{1},{2,3}}, but not strongly lumpable; i.e., the 
matrix fulfils (l22T i but not ( 1211 1. To show that the bound is 

*The bi-partition is obtained by merging two states. 



not tight, we observe that D(Yg||Y' ) = but that, with 



± I & 

1 1 n 

12 12 ^ 



T^P 



(71) 



--- 0.01 we get D{X\\X"') = 0.347 > 0. 



IX. Conclusion 

In this work we presented a new method for Markov chain 
model order reduction based on information-theoretic criteria. 
Specifically, the Kullback-Leibler divergence rate between the 
process obtained by simply partitioning the alphabet of the 
original chain and its best Markov approximation is employed 
as a cost function. The Kullback-Leibler divergence rate be- 
tween the original chain and the lifting of the optimal Markov 
approximation was shown to yield an upper bound on the cost 
function. 

By properly defining the lifting, we not only obtain the 
best upper bound under certain restriction, but also a cost 
function which links the reduced-order model to the notion of 
lumpability. In addition to that, it is shown that the information 
bottleneck method can be used for model reduction by relax- 
ing the optimization problem. Future work shall investigate 
possible connections between the proposed cost function and 
the spectral theory of Markov chains. 

Appendix A 
Proof of Proposition[T] 

For the first property (which is also mentioned in [^) we 
show that the condition 



VUP'V = P'V 



(72) 



from Lemma [T] holds for all possible liftings. Letting P' 
VQU'^ we can continue 



VUVQU'^V = VQU'^V. 



(73) 



Since U'^V = I regardless of the probability vector tt, 
equality is achieved and the first result is proved. We note 
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in passing that this actually proved strong lumpability of X'j^ 
(cf. 0). 

For the second property (cf. [[4] Thm. 2, Property 3]) note 
that with P' = VQU'^, 



the lifted chain X with transition matrix P. We next write for 
the KLDR 



Pr. 



D{X\\X)^ J2 f^^P^,hg^ 



i.jex 



Pr. 



fi' P' = n' VQU'* = v' QU'" = jy' U'" = /x' 



(74) 



P. 



(81) 
(82) 



where the second and third equality are due to Lemma [T] and 
the last follows from the definition of U'^ in (l20l i. 

For the third property we refer the reader to [S3 Thm. 1]. 

Next, observe that Qg(i)g(j) — implies Pij = (see ( |34] |. 
combined with the fact that //, is positive |3, Thm. 4.1.4]). For 
the entries of the lifted matrix we can now write 



H{Y'^) - H{X) + J2 ^'^P^l log ^ (83) 

i,j£X '-^ 



P- = 



'' E. 



kes, ^k 



^g(i)gU)- 



(75) 



We now split the last term as 

Y^ ^l^Pij log T- = XI ^^ H ^^' H 



ijex 



^-J iex ley jeg-^(i) *' ^^ 



Since tt is positive, it follows that _P/ = implies Pij — 0, 
or equivalently, P' 3> P. 

The last property immediately follows from Lemma |2] 
the lemma may be applied because X'*^ is lumpable to Y' 
(property 1) and P' 3> P (property 4). This completes the 
proof. ■ 



Appendix B 
Proof of Theorem[T] 

For the proof we note that another condition for lumpability 
is given by the entries of the matrix R = PV. In particular, 
if and only if for all h^l e 3^ the elements 



Here, the last term on the right is a cross-entropy, since both 
bij and -^ are probability vectors on g^^{l). The cross- 
entropy is minimizecO if and only if for all j S ff~^(0 



P 

b - -^ 



p. 



^'■l Z^fcGg-i(i) "i?^ 



(84) 



Rii ~ 22 ^*J 

jeg-Hi) 



(76) 



are the same for all z G g ^{h), the chain is lumpable w.rt. 
g lIU. Using this with (|39] l one gets 



Rii - 22 ^^i 

j^g-^i) 



P. 



= Qg{i)g{j) 



Ea 



P 



ik 



-Q 



g{i)gU) 



(77) 

(78) 
(79) 



Since the sums over i and / are expecations, the minimum 
is achieved if and only if above condition holds also for all 
i G X and alH £ 3^ for which HiRu > 0. 

To show that the P -lifting indeed yields a better bound 
observe that with H {Yg_„\Yg ^n-i) = H{Y'^) 

= HiX) - H{X) - HiY^) + H{Yg^n\Xn-i) 
= H{Xn) - H{X^\Xn-i) - H{Yg^n) + H(yg^„|X„_i) 

= /(X„;X„_i)-/(yg,„;X„_i)>0 (85) 

by the data processing inequality. D{X\\Xf) > D{Yg\\Y'g) 
is obtained by Lemma |2l 

For the fifth property, note that the sufficient and necessary 
condition for strong lumpability, i.e., for (l2ll . namely that 



Ra — 2^ ^^i ~ ^^^ 

jeg-Hi) 



(86) 



since in this case Sj = g~^{l). Clearly, Ru assumes the same 
values for all i G g^Hh), as required. This completes the proof 
of the first statemenjj. Again, we proved strong lumpability in 
the sense of [3|. 

The second statement is obvious from the Definition of P- 
lifting and from the proof of property 4 of Proposition [1] 

For the third statement, we introduce an arbitrary lifting 



(80) 



is the same for all i G g ^{h), can be used in the definition 
of P: 



P. 



P. 



EfcGSj ^il' 



Q 



P. 



gii)gU) 



Z^keSj ^ik 



RigU) — Pij (87) 



Pij — bijQg[i)g{j) 

bij = 1 for all / 
This condition is necessary and sufficient for lumpability of 



subject to EiGo-ifn ^ij = ^ fo'^ all / G 3^ and all i E X 



This proves the "=^" part. The "<^" part follows from 
Lemma |2l This completes the proof. ■ 
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^Note that this statement holds for all stochastic matrices used for lifting, 
i.e., the lifting matrix does not have to be equal to the transition matrix of 
the original chain. 



**A direct consequence of the fact that the Kullback-Leibler divergence 
vanishes if and only if the considered probability mass functions are equal 1111 
pp. 31]. 
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